inference request
Reducing Latency of LLM Search Agent via Speculation-based Algorithm-System Co-Design
Huang, Zixiao, Zeng, Wen, Fu, Tianyu, Liu, Tengxuan, Sun, Yizhou, Hong, Ke, Yang, Xinhao, Liu, Chengchun, Li, Yan, Zhang, Quanlu, Dai, Guohao, Zhu, Zhenhua, Wang, Yu
LLM-based search agents achieve strong performance but suffer from severe latency, as each step requires serialized LLM reasoning followed by action of tool execution. We revisit this bottleneck through the lens of speculation. While traditional predict-verify speculation paradigm can break serial execution, its benefit remains limited, as it retains the full original workload and adds extra inference overhead. We observe that early agent steps often involve simple evidence-gathering, where correct actions can often be predicted without full reasoning. Building on these observations, we present SPAgent, an algorithm-system co-design framework that expands the role of speculation in search agents to reduce latency. Algorithmically, SPAgent introduces a two-phase adaptive speculation mechanism that selectively omits verification when safe. System-wise, a two-level scheduler regulates speculative requests based on engine load to ensure speculation remains beneficial. We implement SPAgent in real-world systems. Across extensive experimental settings, SPAgent achieves up to $1.65\times$ end-to-end speedup while maintaining same or even achieving higher accuracy, enabling practical deployment of multi-step search agents.
- North America > United States > California > San Diego County > Carlsbad (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- Workflow (0.93)
- Research Report (0.66)
FIRST: Federated Inference Resource Scheduling Toolkit for Scientific AI Model Access
Tanikanti, Aditya, Côté, Benoit, Guo, Yanfei, Chen, Le, Saint, Nickolaus, Chard, Ryan, Raffenetti, Ken, Thakur, Rajeev, Uram, Thomas, Foster, Ian, Papka, Michael E., Vishwanath, Venkatram
We present the Federated Inference Resource Scheduling Toolkit (FIRST), a framework enabling Inference-as-a-Service across distributed High-Performance Computing (HPC) clusters. FIRST provides cloud-like access to diverse AI models, like Large Language Models (LLMs), on existing HPC infrastructure. Leveraging Globus Auth and Globus Compute, the system allows researchers to run parallel inference workloads via an OpenAI-compliant API on private, secure environments. This cluster-agnostic API allows requests to be distributed across federated clusters, targeting numerous hosted models. FIRST supports multiple inference backends (e.g., vLLM), auto-scales resources, maintains "hot" nodes for low-latency execution, and offers both high-throughput batch and interactive modes. The framework addresses the growing demand for private, secure, and scalable AI inference in scientific workflows, allowing researchers to generate billions of tokens daily on-premises without relying on commercial cloud infrastructure.
- North America > United States > Illinois > Cook County > Chicago (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Illinois > Cook County > Lemont (0.04)
- (5 more...)
CryptGNN: Enabling Secure Inference for Graph Neural Networks
Sen, Pritam, Ma, Yao, Borcea, Cristian
We present CryptGNN, a secure and effective inference solution for third-party graph neural network (GNN) models in the cloud, which are accessed by clients as ML as a service (MLaaS). The main novelty of CryptGNN is its secure message passing and feature transformation layers using distributed secure multi-party computation (SMPC) techniques. CryptGNN protects the client's input data and graph structure from the cloud provider and the third-party model owner, and it protects the model parameters from the cloud provider and the clients. CryptGNN works with any number of SMPC parties, does not require a trusted server, and is provably secure even if P-1 out of P parties in the cloud collude. Theoretical analysis and empirical experiments demonstrate the security and efficiency of CryptGNN.
- North America > United States > New Jersey > Essex County > Newark (0.04)
- North America > United States > New York > Rensselaer County > Troy (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (5 more...)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Services (0.68)
Twill: Scheduling Compound AI Systems on Heterogeneous Mobile Edge Platforms
Taufique, Zain, Vyas, Aman, Miele, Antonio, Liljeberg, Pasi, Kanduri, Anil
--Compound AI (cAI) systems chain multiple AI models to solve complex problems. Deploying cAI services on mobile edge platforms poses a significant challenge in scheduling concurrent DNN-transformer inference tasks, which arrive dynamically in an unknown sequence. Existing mobile edge AI inference strategies manage multi-DNN or transformer-only workloads, relying on design-time profiling, and cannot handle concurrent inference of DNNs and transformers required by cAI systems. In this work, we address the challenge of scheduling cAI systems on heterogeneous mobile edge platforms. We present Twill, a run-time framework to handle concurrent inference requests of cAI workloads through task affinity-aware cluster mapping and migration, priority-aware task freezing/unfreezing, and Dynamic V oltage/Frequency Scaling (DVFS), while minimizing inference latency within power budgets. We implement and deploy our Twill framework on the Nvidia Jetson Orin NX platform. We evaluate Twill against state-of-the-art edge AI inference techniques over contemporary DNNs and LLMs, reducing inference latency by 54% on average, while honoring power budgets. AI applications are rapidly evolving from monolithic models towards Compound Artificial Intelligence (cAI) systems, which integrate multiple task-specific models and components to solve complex problems [1]-[3]. Emerging cAI systems combine Large Language Models (LLMs) with Deep Neural Networks (DNNs) for providing novel services such as conversational language agents [2]-[5], augmented and virtual reality (AR/VR) gear, and interactive autonomous vehicles [6]. In this example, DNN models ( D1: VGG-19 and D2: ResNet-152) are used for image classification, and object detection, transformer models ( T1: Bert-base and T2: Bert-large) are used for text summarizing and classification, and generative transformers ( T3: OPT-350M and LLM: Deepseek-R1) are used for reasoning and report generation. Each model is responsible for extracting key features from the given input and sending the output to the subsequent models to perform collaborative tasks. T1, D1, and D2 are exclusive inference tasks that can run simultaneously, while T2, T3, and LLM are dependent on the outputs of other models. We deployed the exemplar cAI system on the Nvidia Jetson Orin NX platform.
- Europe > Finland > Southwest Finland > Turku (0.05)
- North America > United States (0.04)
- Europe > Italy > Lombardy > Milan (0.04)
Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency
Li, Ruixiao, Chen, Fahao, Li, Peng
Speculative decoding accelerates Large Language Model (LLM) inference by employing a small speculative model (SSM) to generate multiple candidate tokens and verify them using the LLM in parallel. This technique has been widely integrated into LLM inference serving systems. However, inference requests typically exhibit uncertain execution time, which poses a significant challenge of efficiently scheduling requests in these systems. Existing work estimates execution time based solely on predicted output length, which could be inaccurate because execution time depends on both output length and token acceptance rate of verification by the LLM. In this paper, we propose a semi-clairvoyant request scheduling algorithm called Least-Attained/Perceived-Service for Speculative Decoding (LAPS-SD). Given a number of inference requests, LAPS-SD can effectively minimize average inference latency by adaptively scheduling requests according to their features during decoding. When the token acceptance rate is dynamic and execution time is difficult to estimate, LAPS-SD maintains multiple priority queues and allows request execution preemption across different queues. Once the token acceptance rate becomes stable, LAPS-SD can accurately estimate the execution time and schedule requests accordingly. Extensive experiments show that LAPS-SD reduces inference latency by approximately 39\% compared to state-of-the-art scheduling methods.
- North America > United States (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > Finland > Southwest Finland > Turku (0.04)
- Asia > China > Shaanxi Province > Xi'an (0.04)
Splitwiser: Efficient LM inference with constrained resources
Aali, Asad, Cardoza, Adney, Capo, Melissa
--Efficient inference of LLMs remains a crucial challenge, with two main phases: a compute-intensive prompt computation and a memory-intensive token generation. Despite existing batching and scheduling techniques, token generation phases fail to fully utilize compute resources, especially when compared to prompt computation phases. T o address these challenges, we propose Splitwiser, a methodology that splits the two phases of an LLM inference request onto the same GPU, thereby reducing overhead and improving memory access and cache utilization. By eliminating the need to transfer data across devices, Splitwiser aims to minimize network-related overheads. In this report, we describe the basic structure of our proposed pipeline while sharing preliminary results and analysis. We implement our proposed multiprocessing design on two widely-used and independent LLM architectures: Huggingface and vLLM. Generative Large Language Models (LLMs) have become essential in computing, offering vast capabilities in natural language processing. However, their widespread adoption has led to challenges, particularly in inference efficiency.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > California > San Diego County > Carlsbad (0.04)
Scalable Runtime Architecture for Data-driven, Hybrid HPC and ML Workflow Applications
Merzky, Andre, Titov, Mikhail, Turilli, Matteo, Kilic, Ozgur, Wang, Tianle, Jha, Shantenu
Hybrid workflows combining traditional HPC and novel ML methodologies are transforming scientific computing. This paper presents the architecture and implementation of a scalable runtime system that extends RADICAL-Pilot with service-based execution to support AI-out-HPC workflows. Our runtime system enables distributed ML capabilities, efficient resource management, and seamless HPC/ML coupling across local and remote platforms. Preliminary experimental results show that our approach manages concurrent execution of ML models across local and remote HPC/cloud resources with minimal architectural overheads. This lays the foundation for prototyping three representative data-driven workflow applications and executing them at scale on leadership-class HPC platforms.
- North America > United States > New Jersey > Middlesex County > Piscataway (0.14)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- Asia > Middle East > Jordan (0.04)
AOLO: Analysis and Optimization For Low-Carbon Oriented Wireless Large Language Model Services
Wang, Xiaoqi, Du, Hongyang, Gao, Yuehong, Kim, Dong In
Recent advancements in large language models (LLMs) have led to their widespread adoption and large-scale deployment across various domains. However, their environmental impact, particularly during inference, has become a growing concern due to their substantial energy consumption and carbon footprint. Existing research has focused on inference computation alone, overlooking the analysis and optimization of carbon footprint in network-aided LLM service systems. To address this gap, we propose AOLO, a framework for analysis and optimization for low-carbon oriented wireless LLM services. AOLO introduces a comprehensive carbon footprint model that quantifies greenhouse gas emissions across the entire LLM service chain, including computational inference and wireless communication. Furthermore, we formulate an optimization problem aimed at minimizing the overall carbon footprint, which is solved through joint optimization of inference outputs and transmit power under quality-of-experience and system performance constraints. To achieve this joint optimization, we leverage the energy efficiency of spiking neural networks (SNNs) by adopting SNN as the actor network and propose a low-carbon-oriented optimization algorithm, i.e., SNN-based deep reinforcement learning (SDRL). Comprehensive simulations demonstrate that SDRL algorithm significantly reduces overall carbon footprint, achieving an 18.77% reduction compared to the benchmark soft actor-critic, highlighting its potential for enabling more sustainable LLM inference services.
HiDP: Hierarchical DNN Partitioning for Distributed Inference on Heterogeneous Edge Platforms
Taufique, Zain, Vyas, Aman, Miele, Antonio, Liljeberg, Pasi, Kanduri, Anil
Edge inference techniques partition and distribute Deep Neural Network (DNN) inference tasks among multiple edge nodes for low latency inference, without considering the core-level heterogeneity of edge nodes. Further, default DNN inference frameworks also do not fully utilize the resources of heterogeneous edge nodes, resulting in higher inference latency. In this work, we propose a hierarchical DNN partitioning strategy (HiDP) for distributed inference on heterogeneous edge nodes. Our strategy hierarchically partitions DNN workloads at both global and local levels by considering the core-level heterogeneity of edge nodes. We evaluated our proposed HiDP strategy against relevant distributed inference techniques over widely used DNN models on commercial edge devices. On average our strategy achieved 38% lower latency, 46% lower energy, and 56% higher throughput in comparison with other relevant approaches.
- Europe > Finland > Southwest Finland > Turku (0.05)
- North America > United States (0.04)
- Europe > Italy > Lombardy > Milan (0.04)
- Asia (0.04)
- Information Technology (0.48)
- Semiconductors & Electronics (0.46)
SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference
Oliaro, Gabriele, Jia, Zhihao, Campos, Daniel, Qiao, Aurick
We present SuffixDecoding, a novel model-free approach to accelerating large language model (LLM) inference through speculative decoding. Unlike existing methods that rely on draft models or specialized decoding heads, SuffixDecoding leverages suffix trees built from previously generated outputs to efficiently predict candidate token sequences. Our approach enables flexible tree-structured speculation without the overhead of maintaining and orchestrating additional models. SuffixDecoding builds and dynamically updates suffix trees to capture patterns in the generated text, using them to construct speculation trees through a principled scoring mechanism based on empirical token frequencies. SuffixDecoding requires only CPU memory which is plentiful and underutilized on typical LLM serving nodes. We demonstrate that SuffixDecoding achieves competitive speedups compared to model-based approaches across diverse workloads including open-domain chat, code generation, and text-to-SQL tasks. For open-ended chat and code generation tasks, SuffixDecoding achieves up to $1.4\times$ higher output throughput than SpecInfer and up to $1.1\times$ lower time-per-token (TPOT) latency. For a proprietary multi-LLM text-to-SQL application, SuffixDecoding achieves up to $2.9\times$ higher output throughput and $3\times$ lower latency than speculative decoding. Our evaluation shows that SuffixDecoding maintains high acceptance rates even with small reference corpora of 256 examples, while continuing to improve performance as more historical outputs are incorporated.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)